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Gradient Boosted Decision Tree 


Roadmap 


@ Embedding Numerous Features: Kernel Models 
© Combining Predictive Features: Aggregation Models 











Lecture 10: Random Forest 


bagging of randomized C&RT trees with 
automatic validation and feature selection 





Lecture 11: Gradient Boosted Decision Tree 
e Adaptive Boosted Decision Tree 
ə Optimization View of AdaBoost 
ə Gradient Boosting 

ə Summary of Aggregation Models 





© Distilling Implicit Features: Extraction Models 
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Gradient Boosted Decision Tree Adaptive Boosted Decision Tree 


From Random Forest to AdaBoost-DTree 


function RandomForest(D) function AdaBoost-DTree(D) 
FORTS We2oaccy if FOF f= 1 Zoooas Uf 
© request size-N’ data D; by © reweight data by u’ 
bootstrapping with D 
@ obtain tree g: by @ obtain tree g; by 
Randomized-DTree(D;) DTree(D, u’) 
© calculate ‘vote’ a; of gi 
return G = LinearHypo({(9¢, az) }) 









return G = Uniform({g;}) 





need: weighted DTree(D, u“") 
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Gradient Boosted Decision Tree Adaptive Boosted Decision Tree 


Weighted Decision Tree Algorithm 
Weighted Algorithm 


minimize (regularized) EY (h) = 4 Soy Un- err(Yn, h(Xn)) 











if using existing algorithm as black box (no modifications), 
to get Ep approximately optimized...... 









‘Weighted’ Algorithm in 

Bagging 

weights u expressed by 
bootstrap-sampled copies 

—request size-N’ data D; 

by bootstrapping with D 


A General Randomized Base 
Algorithm 


weights u expressed by 
sampling proportional to un 

—request size-N’ data D; 

by sampling « u on D 












AdaBoost-DTree: often via 7 
AdaBoost + sampling « u + DTree(D;) 
without modifying DTree 
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Gradient Boosted Decision Tree Adaptive Boosted Decision Tree 


Weak Decision Tree Algorithm 


AdaBoost: votes a; = In @; = In ,/1=* with weighted error rate et 





Et 





if fully grown tree trained on all Xn 
=> Ejn(gt) = 0 if all x, different 
= Ex(gr) = 0 
=> « = 0 
=> at = œ (autocracy!!) 





need: pruned tree trained on some x, to be weak 
e pruned: usual pruning, or just limiting tree height 
e some: sampling œx ul’ 


AdaBoost-DTree: often via AdaBoost + 
sampling « uí + pruned DTree(D) | 
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Gradient Boosted Decision Tree Adaptive Boosted Decision Tree 


AdaBoost with Extremely-Pruned Tree 


what if DTree with height < 1 (extremely pruned)? 





DTree (C&RT) with height < 1 
learn branching criteria 








2 
b(x)=  argmin X [De with h| - impurity(De with h) 


decision stumps A(x)  c=4 


— if impurity = binary classification error, 
just a decision stump, remember? :-) 


AdaBoost-Stump 
= special case of AdaBoost-DTree 
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Gradient Boosted Decision Tree Adaptive Boosted Decision Tree 


Fun Time 


When running AdaBoost-DTree with sampling and getting a decision 
tree g; such that g; achieves zero error on the sampled data set D;. 
Which of the following is possible? 


0 a <0 
@ a =0 
6 a>O0 


© all of the above 





Gradient Boosted Decision Tree Adaptive Boosted Decision Tree 


Fun Time 


When running AdaBoost-DTree with sampling and getting a decision 
tree g; such that g; achieves zero error on the sampled data set D;. 


Which of the following is possible? 
@ a; <0 
© a; =0 
6 a>O0 
© all of the above 


Reference Answer: © 


While g; achieves zero error on DA gt may not 
achieve zero weighted error on (D, u)) and 
hence e+ can be anything, even > $. Then, a; 
can be < 0. 
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Gradient Boosted Decision Tree Optimization View of AdaBoost 


Example Weights of AdaBoost 


i= 
= 
| 


(ei) u . e if incorrect 
y!) /%: if correct 






= ul). erran) = uf . exp (—ynargr(Xn)) 





T 
a == uf) : ] | exp(-ynargi(xn)) = N : 
t=] 





e recall: G(x) = sign ( 






voting score of {g;} on x 


AdaBoost: uf +” « exp(—yn( votir 
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Gradient Boosted Decision Tree Optimization View of AdaBoost 


Voting Score and Margin 
linear blending = LinModel + hypotheses as transform + COPstrænis 








voting score 





= 
G(Xn) = sign DS at Gt(Xn) 
= Wi di(Xn) 


wT &(Xn)+b) 


and hard-margin SVM margin = Vaal iwi , remember? :-) 







yn(voting score) = signed & unnormalized margin 


want y,(voting score) positive & large 
S exp(—Yn(voting score)) small 
e uf” small 


claim: AdaBoost decreases 5^"; u{® | 
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Gradient Boosted Decision Tree Optimization View of AdaBoost 


AdaBoost Error Function 


claim: AdaBoost decreases Ņ>^ u\? and thus somewhat minimizes 


2 (ay l 3 

+ 
Un = N exp ( ) 
n= n=1 





a 
-Yn at9t(Xn) 


u=] 






linear score s = ya atQt(Xn) 


e ettg/1(S, y) = [ys < O] 

e etfapa(S, y) = exp(—ys): 
upper bound of erro /; 
—called exponential error 

measure 


€tTapa: algorithmic error measure 
by convex upper bound of erro; 





Gradient Boosted Decision Tree Optimization View of AdaBoost 


AdaBoost Error Function 


claim: AdaBoost decreases Ņ>^ u\? and thus somewhat minimizes 


2 (ay l 3 

+ 
Un = N exp ( ) 
n= n=1 





a 
-Yn at9t(Xn) 


u=] 






linear score s = ya atQt(Xn) 


e ettg/1(S, y) = [ys < O] 

e etfapa(S, y) = exp(—ys): 
upper bound of erro /; 
—called exponential error 

measure 


€tTapa: algorithmic error measure 
by convex upper bound of erro; 





Gradient Boosted Decision Tree Optimization View of AdaBoost 


AdaBoost Error Function 


claim: AdaBoost decreases Ņ>^ u\? and thus somewhat minimizes 












Nee ge if 
> = 7 S exp (5 argilan) 
n= n= f=] 
j sS 6 —0/1 
linear score s = ” -1 arQt(Xn) 
e errg/1(S, y) = [ys < 0] - 
e eTapa(S, Y) = exp(—ys): 2 
upper bound of erro /; 1 i a 
—called exponential error 0 
measure a ne “a - L 2 2 





Etrapa: algorithmic error measure 
by convex upper bound of erro; 


Gradient Boosted Decision Tree Optimization View of AdaBoost 


AdaBoost Error Function 


claim: AdaBoost decreases Ņ>^ u\? and thus somewhat minimizes 






a yeenlee u 
> uf D N S exp (5 argilan) 
n=1 n=1 t=1 
linear score s = ` L4 a19t(Xn) 6 a 
e erro (S, y) = [ys < O] a 
e ETaoa(S, Y) = exp(—ys): 2 
upper bound of erro / l 
0 








—called exponential error 
measure 


€tTapa: algorithmic error measure 
by convex upper bound of erro; 





Gradient Boosted Decision Tree Optimization View of AdaBoost 


Gradient Descent on AdaBoost Error Function 


recall: gradient descent (remember? :-)), at iteration t 
min Ein(We +7V) © Ein(wi)+ on! VEn(wi) 
SS Se ee 


iIvl=1 








known given positive known 


at iteration t, to find gr, solve 


N t-1 
a 1 
nai Eaa = N > exp (-» (x 7 Or(Xn) + i) 


r= 


N 
= Soul) exp (-ynnh(xn)) 


n=1 
taylor X X X 
t 
aD u® (1 — yanh(Xn)) = >p hp DI uP ynh(xn) 
n=1 ntl n= 





good h: minimize Du ul!) (—Ynh(Xn)) 
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Gradient Boosted Decision Tree Optimization View of AdaBoost 


Learning Hypothesis as Optimization 


finding good h (function direction) = minimize ae uy) (—Ynh(Xn)) 


for binary classification, where yn and h(x,) both € {—1, +1}: 
(t) 1 if Yn = A(Xn) 
t = = n 
2 uh (Yah) = yo rea 
N 
a (t) (t) J 0 if Yn = A(Xn) 
yuh mae: l 2 if Ya # h(xn) 
= = -5 + 2EY” (h). N 


—who minimizes EY (h)? A in AdaBoost! :-) 


A: good g: = h for ‘gradient descent’ | 
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Gradient Boosted Decision Tree Optimization View of AdaBoost 


Deciding Blending Weight as Optimization 


AdaBoost finds g; by approximately min a T ys ) exp (— Ynnh(Xn)) 


Ly 


after finding g;, how about sy es ul! ) exp (—Ynngt(Xn)) 


p= 





e optimal 7; somewhat ‘greedily faster’ than fixed (small) n 
—called steepest descent for optimization 
e two Cases inside summation: 
© Yn = gi(Xn) : us? exp (—n) (correct) 
© Yn # Ot(Xn) : ul) exp (+n) (incorrect) 


Ea oe 1 ul) . (a — et) exp (n) + & exp (+1)) 








by solving faon = 0, steepest 7; = In 4/1 =t = at, remember? :-) 
Parbo a descent with approximate functional gradient 
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Gradient Boosted Decision Tree Optimization View of AdaBoost 


Fun Time 


With Êana = (Ook us) ( (1 — et) exp (—n) + et exp (+n), which 
of the following is afron that can be used for solving the optimal n:? 








Gradient Boosted Decision Tree Optimization View of AdaBoost 


Fun Time 


With Exon = (Sones Uh?) - ( (1 = er) exp (=n) + er exp (+n), which 
of the following is 252s that can be used for solving the optimal n? 

@ (Zra uP): Of (+n) 

© (dons uh?) - (+ (1 — e) exp (=n) — et exp (+n) 

© (Er ul) (= (1 =a) exp (=n) + et exp (+n) 

© (£h uh) - (= (1 — e) exp (=n) — et exp (+n) 





(1 — et) exp(—7) + e: exp 


(= 
= 





exp 





See ee 





Reference Answer: G) 


Differentiate exp(—7) and exp(+7)) with respect 
to 7 and you should easily get the result. 
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Gradient Boosted Decision Tree Gradient Boosting 


Gradient Boosting for Arbitrary Error Function 
AdaBoost 


N t-1 
sti on r 3 exp (-» (x arg-(Xn) ar inex) ) 


T=1 












with binary-output hypothesis h 


GradientBoost 


min min = en (x Arr (Xn) + nh(Xn), y: r) 





with any hypothesis h (usually real-output hypothesis) 


GradientBoost: allows extension to different 
err for regression/soft classification/etc. | 
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GradientBoost for Regression 


with err(s, y) = (s — y)? 





T 1 Sers y) err (Sn, Yn) +35 53 nh(Xn) Sen Yn) 


— S—on 
= EET constant 





= min constants + BL Ss h(Xn) + 2(Sn — Yn) 


n=1 





naive solution h(Xn) = —oco - (Sn — Yn) 
if no constraint on h 








Gradient Boosted Decision Tree Gradient Boosting 


Learning ETE as Optimization 


min constants + > ie A(Xn)(Sn — Yn) 


e magnitude of h does not matter: because 7 will be optimized next 
e penalize large magnitude to avoid naive solution 
N 









min constants + — X` (2h(Xn)(Sn — Yn) + (A(%n))*) 


n= 





= constants + 2 (constant + (A(Xn) — (Yn — sn))’) 


e solution of penalized approximate functional gradient: 
squared-error regression on {(Xn, Yn — Sn )} 
—K——/ 


residual 


GradientBoost for regression: 
find gt = h by regression with residuals | 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/25 


Gradient Boosted Decision Tree Gradient Boosting 
Deciding Blending Weight as Optimization 
after finding g: = h 


N t=] 
min ry, doen ara Xn) +ngi(Xn); Yn) with err(s, y) = (s — y)? 
ee~_- —_—_’ 


Sn 









N N 
min 7 PC +79t(Xn) — Yn)? = N y On ~ Sn) — ngt(Xn))? 


n 


—one-variable linear regression on {(g;-transformed input, residual) } 


GradientBoost for regression: a; = optimal n 
by g;-transformed linear regression | 


Gradient Boosted Decision Tree Gradient Boosting 


Putting Everything Together 


Gradient Boosted Decision Tree (GBDT) 
Sse = = Si — 0 
fort = ee i 
@ obtain g: by A({ (Xn, Yn — Sn)}) where A is a (Squared-error) 
regression algorithm 
—how about sampled and pruned C&RT? 


© compute a; = OneVarLinearRegression({(9:(Xn), Yn — Sn) }) 
© update Sn < Sn + argt(Xn) 
return G(x) = J L4 arg: (X) 


GBDT: ‘regression sibling’ of AdaBoost-DTree 
—popular in practice 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/25 
















Gradient Boosted Decision Tree Gradient Boosting 


Fun Time 


Which of the following is the optimal 7) for 







N 
min = N (OYn = Sn) — 9t(Xn))? 





© (Eh 9t(Xn)(Ya — Sn) © (ps GF (Xn) 
(2) (38 1 9t(Xn)(Yn — Sn)) / pa 1 Of (Xn) 
© Oars 1 9t(Xn)(Yn — Sn)) + (28 1 Ot (Xn) 
O (oper Xn) Yn- Sn) — (Eh 9? (Xn)) 





Gradient Boosted Decision Tree Gradient Boosting 


Fun Time 


Which of the following is the optimal 7) for 


N 
min 5 (Wn ~ Sn) ~ngt(Ke))? 


© (Eh Xn) (Yn — Sn) © (ps GP (Xn) 
(2) (38 1 9t(Xn)(Yn — Sn)) / (s2 1 Of (Xn) 
© aa 1 9t(Xn)(Yn — Sn)) + (a 1 Ot (Xn) 

O (© 1 9t(Xn)(Yn — Sn)) — Oe 1 Of (Xn) 








Reference Answer: A 


Derived within Lecture 9 of ML Foundations, | 


remember? :-) | 
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Gradient Boosted Decision Tree Summary of Aggregation Models 


Map of Blending Models 


blending: aggregate after getting diverse g; | 






uniform 
simple 
voting/averaging of gt 







non-uniform 
linear model on 
g;-transformed inputs 







conditional 
nonlinear model on 
gi-transformed inputs 









uniform for ‘stability’; 
non-uniform/conditional carefully for 
‘complexity’ 





Gradient Boosted Decision Tree Summary of Aggregation Models 


Map of Aggregation-Learning Models 
learning: aggregate as well as getting diverse gr 
Bagging AdaBoost 










Decision Tree 
















diverse g; by diverse gt diverse gt 
bootstrapping; by reweighting; by data splitting; 
uniform vote linear vote conditional vote 
by nothing :-) by steepest search by branching 





GradientBoost 
diverse gt 

by residual fitting; 
linear vote 

by steepest search 


boosting-like algorithms most popular ) 


Gradient Boosted Decision Tree Summary of Aggregation Models 


Map of Aggregation of Aggregation Models 


Bagging AdaBoost Decision Tree 





AdaBoost-DTree 


AdaBoost 
+ ‘weak’ DTree 










Random Forest 


randomized bagging 
+ ‘strong’ DTree 









GradientBoost 







GradientBoost 
+ ‘weak’ DTree 


all three frequently used in practice 


Gradient Boosted Decision Tree Summary of Aggregation Models 


Specialty of Aggregation Models 

































cure overfitting 
e G(x) ‘moderate’ 


e aggregation 
=> regularization 


cure underfitting 
e G(x) ‘strong’ 
e aggregation 

= feature transform 


proper aggregation (a.k.a. ‘ensemble’) 
==> better performance | 
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Gradient Boosted Decision Tree Summary of Aggregation Models 


Fun Time 


Which of the following aggregation model learns diverse g; by 
reweighting and calculates linear vote by steepest search? 


@ AdaBoost 

© Random Forest 
© Decision Tree 
© Linear Blending 





Gradient Boosted Decision Tree Summary of Aggregation Models 


Fun Time 


Which of the following aggregation model learns diverse g; by 
reweighting and calculates linear vote by steepest search? 

@ AdaBoost 

© Random Forest 

© Decision Tree 

© Linear Blending ] 











Reference Answer: ‘ap 


Congratulations on being an expert in 
aggregation models! :-) 
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Summary 


@ Embedding Numerous Features: Kernel Models 
© Combining Predictive Features: Aggregation Models 


Lecture 11: Gradient Boosted Decision Tree 


ə Adaptive Boosted Decision Tree 
sampling and pruning for ‘weak’ trees 
ə Optimization View of AdaBoost 
functional gradient descent on exponential error 
ə Gradient Boosting 
iterative steepest residual fitting 
ə Summary of Aggregation Models 
some cure underfitting; some cure overfitting 





© Distilling Implicit Features: Extraction Models 
e next: extract features other than hypotheses 





